Before this course

  • 歡迎任何問題,課程中有問題請隨時到臉書 上問我。

About R

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. …… R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

R-4.5.1 was released since 2025 June.


How/where to obtain and install R?

If you use linux as your default os, you can install R from the package repositories of each distribution directly. Alternatively, you can download R binary-version or source code from CRAN if you use M$ windows or Mac OS.

Ubuntu users

  • Update indices with sudo apt update -qq
  • Install two helper packages we need sudo apt install --no-install-recommends software-properties-common dirmngr
  • Add the signing key with wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
  • Add the repo from CRAN with sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"
  • Install R with sudo apt install --no-install-recommends r-base r-base-dev

MacOS users

  • Intel x86-64: R-4.5.1-x86_64.pkg
  • Apple silicon arm64: R-4.5.1-arm64.pkg

Windows users

  • Download the latest version of R from CRAN
  • Install R with the downloaded installer
  • Install Rtools from CRAN
    • Rtools is a collection of tools for building R packages on Windows. It includes a compiler, a set of libraries, and other tools that are needed to build R packages from source.
  • Add Rtools to your PATH environment variable
    • Open the Control Panel and go to System and Security > System > Advanced system settings > Environment Variables.
    • Under System variables, find the PATH variable and click Edit.
    • Add the path to the Rtools bin directory (e.g., C:\Rtools\bin) to the PATH variable.
    • Click OK to save the changes.

CRAN Repositories


Using RStudio as your default R-programming IDE

About RStudio

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

Install the most suitable version of RStudio for your needs.

  • Desktop version: Access RStudio locally.
  • Server version: Access via a web browser.

Other choices

  • : VS code
  • : Vim+Nvim-R
  • Any other text editors: gedit, emacs(+ESS), eclipse and etc.

The most important step when beginning to learn R is using help()

help() & help.search()

help(help)
help.search("standard deviation")

? & ??

?mean
??hypergeometric

Package installation & PATH setting

Installing packages in R console

# Download Pkgs from CRAN repository & install
install.packages('rmarkdown',                        # Package name
                 repo="http://cran.csie.ntu.edu.tw", # The URL of CRAN repository
                 destdir="~/Download",               # The directory where downloaded pkgs are stored
                 lib=.libPaths()[1])                 # The directory where to install pkgs

# Install Pkgs from downloaded source code
install.packages('~/Download/rmarkdown_0.5.1.tar.gz',
                 repos=NULL,
                 type="source",
                 lib=.libPaths()[1])

Installing packages in terminal

$ R CMD INSTALL -l $HOME/R/4.1 rmarkdown_0.5.1.tar.gz

Setting PATH

.libPaths(new)  # .libPaths("/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library")

Some Pkgs should be downloaded/installed from R-forge

Set install.packages(Pkg, repo='http://R-Forge.R-project.org')

Using the package installed

library(Pkg)
require(Pkg) # Avoid to use this!

What is the difference between require() and library()


Bioconductor

About Bioconductor

Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development.

Install Pkgs from Bioconductor

# Install BiocManager
install.packages("BiocManager")
BiocManager::install(pkgname)

Rich course materials

Courses & conference


Basic operation

5+5
5-3
5*3
5/3
5^3
10%%3

# Variable declaration
x <- 5 # '<-' is assign operator in R, which is equivalent to '='
y <- function(i) mean(i)

Data and object types

Data types

  • numeric: c(1:3, 5 ,7)
  • character: c("1","2","3"); LETTERS[1:3]
  • logical: TRUE; FALSE
  • complex: 1, b, 3

Object types

  • vector: the data types of all elements in a vector must be consistent!
x <- 1:5
y <- c(6,7,8,9,10)
z <- x - y
print(z)
## [1] -5 -5 -5 -5 -5
# Vectorized code performs better!
a <- 1:100000
system.time(mean(a))
##    user  system elapsed 
##       0       0       0
total <- 0
system.time(for (i in a) {total <- total + i; total/100000})
##    user  system elapsed 
##   0.002   0.000   0.002
  • matrix
x <- matrix(rnorm(100), nr=20, nc=5)
print(x)
##              [,1]        [,2]        [,3]        [,4]        [,5]
##  [1,]  1.74163397 -0.03810272 -0.96120209 -0.24176098 -0.54965156
##  [2,] -0.63067205  1.08667561  1.42206483  0.29232286 -1.16397628
##  [3,] -0.46148956 -0.35106963 -0.21639020  0.35608953  0.34719150
##  [4,] -1.04802956  2.01831013  1.22964837  0.45379716 -0.11585321
##  [5,]  0.13195893 -1.28455967 -1.26447258 -0.10675446  0.36857838
##  [6,] -0.58721462  0.97248284  1.31319478  0.67057845 -2.41650094
##  [7,] -1.24989474  0.36190303  1.10358137 -1.00510130  0.69042402
##  [8,]  0.08537424 -0.14246972 -0.39495421  1.17932923  0.42306304
##  [9,] -0.78610579  0.25176024  1.53546452 -0.90384679  0.88699519
## [10,]  1.31152031 -0.70569802 -2.56955242 -0.21647034  2.36266403
## [11,] -0.88759818 -0.43149353  0.38448366  0.05873202 -0.71380946
## [12,] -0.61949104 -0.61605747 -0.76844046 -0.36242184 -0.42889161
## [13,] -0.34410229  0.70567089 -0.32650641 -0.41827765 -0.84913210
## [14,]  0.29294332  1.55538395  0.33381981 -0.54142571  1.84165214
## [15,]  0.28183061  0.78501610 -0.27829481  0.44594465  1.05169681
## [16,] -0.71475084 -1.28650292 -0.47671468  0.15007436 -2.26454406
## [17,]  0.46361439  0.36132323 -0.17796319  0.32068415  0.77806603
## [18,] -0.44370991  0.26336666 -0.72543640  1.03957070 -2.13608733
## [19,]  0.25944652  0.57629562  0.04361425 -0.63841780 -0.04623529
## [20,]  1.08022782  0.51257606 -0.32890740 -1.46289639 -0.91684247
x[1,3]
x[2:4,]
x[,3:5]
x %*% t(x)

# A matrix is a vector with subscripts!
x[1:3]
x[1:3,1]
  • array
y <- array(rnorm(64), c(8,4,2))
print(y) # An array is also a vector with subscripts!
## , , 1
## 
##             [,1]       [,2]        [,3]        [,4]
## [1,]  0.71682911  0.3388699 -0.38667789 -0.32485831
## [2,] -0.23540483  1.2188133 -0.28526021 -1.43122803
## [3,]  0.08132329  0.6019541  0.46479017 -0.30325276
## [4,] -0.73745649 -0.5169750 -1.47722329  0.36829250
## [5,]  0.71191898 -0.3167899 -0.02761896 -0.44708518
## [6,] -0.74837075 -1.3293984 -0.49748233 -0.02361309
## [7,] -0.85238593  1.0331304  1.16084408 -1.17082143
## [8,]  1.91537588  0.5788778  0.22488158 -0.26559034
## 
## , , 2
## 
##             [,1]       [,2]         [,3]        [,4]
## [1,] -1.20186813 -1.7899444  1.292272513 -0.88905775
## [2,] -1.99654690 -0.2733448 -2.217147089  0.25037685
## [3,]  1.40991818 -0.3525800  0.206633382 -0.39882941
## [4,] -0.36794142 -0.9704212 -1.850420672 -1.45267972
## [5,]  1.38158795 -0.1740100  0.522769625  1.70312843
## [6,]  1.64546427  1.0129868 -1.302574387  0.08521852
## [7,]  0.04183054 -0.9289487  3.251268832  0.69315357
## [8,]  0.29128281 -1.5919374  0.003467996 -0.81942408
  • list: the data types of elements in a list could be complex
x<-list(1:5, c("a","b","c"), matrix(rnorm(10),nr=5,nc=2))
print(x)
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1] "a" "b" "c"
## 
## [[3]]
##             [,1]       [,2]
## [1,] -0.60090293  1.0370934
## [2,] -0.89184783 -0.8817037
## [3,] -0.99452267 -1.3293370
## [4,]  0.08283317 -0.5032278
## [5,]  0.91827243  1.2831393
x$mylist <- x
print(x)
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1] "a" "b" "c"
## 
## [[3]]
##             [,1]       [,2]
## [1,] -0.60090293  1.0370934
## [2,] -0.89184783 -0.8817037
## [3,] -0.99452267 -1.3293370
## [4,]  0.08283317 -0.5032278
## [5,]  0.91827243  1.2831393
## 
## $mylist
## $mylist[[1]]
## [1] 1 2 3 4 5
## 
## $mylist[[2]]
## [1] "a" "b" "c"
## 
## $mylist[[3]]
##             [,1]       [,2]
## [1,] -0.60090293  1.0370934
## [2,] -0.89184783 -0.8817037
## [3,] -0.99452267 -1.3293370
## [4,]  0.08283317 -0.5032278
## [5,]  0.91827243  1.2831393
  • data frame: a data frame is collection of multiple lists with the same length
df<-data.frame(num=1:10, 
           char=LETTERS[1:10], 
           logic=sample(c(TRUE,FALSE), 10, replace=TRUE))

df
##    num char logic
## 1    1    A  TRUE
## 2    2    B  TRUE
## 3    3    C FALSE
## 4    4    D FALSE
## 5    5    E FALSE
## 6    6    F FALSE
## 7    7    G FALSE
## 8    8    H  TRUE
## 9    9    I  TRUE
## 10  10    J  TRUE
df$char
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
df$logic[5:7]
## [1] FALSE FALSE FALSE
  • factor: An R factor might be viewed simply as a vector with a bit more information added (though, as seen below, it’s different from this internally). That extra information consists of a record of the distinct values in that vector, called levels.
x <- c(5, 12, 32, 12)
xf <- factor(x)
print(xf)
## [1] 5  12 32 12
## Levels: 5 12 32

So…. a factor looks like a vector, right?

str(xf) # Here str stands for structure. This function shows the internal structure of any R object.
##  Factor w/ 3 levels "5","12","32": 1 2 3 2
unclass(xf)
## [1] 1 2 3 2
## attr(,"levels")
## [1] "5"  "12" "32"
length(xf)
## [1] 4

What??? What are you talking about?

x <- c(5, 12, 13, 12)
xff <- factor(x, levels=c(5, 12, 13, 88))
xff
## [1] 5  12 13 12
## Levels: 5 12 13 88
xff[2] <- 88 
xff
## [1] 5  88 13 12
## Levels: 5 12 13 88
xff[2] <- 28 # You cannot sneak in an "illegal" level
## Warning in `[<-.factor`(`*tmp*`, 2, value = 28): invalid factor level, NA
## generated
  • table: Another common way to store information is in a table.
# One way table
a <- factor(c("A","A","B","A","B","B","C","A","C"))
a
## [1] A A B A B B C A C
## Levels: A B C
a.table <- table(a)
a.table
## a
## A B C 
## 4 3 2
attributes(a.table)
## $dim
## [1] 3
## 
## $dimnames
## $dimnames$a
## [1] "A" "B" "C"
## 
## 
## $class
## [1] "table"
# Two way table
a <- c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes","Never")
b <- c("Maybe","Maybe","Yes","Maybe","Maybe","No","Yes","No")
twoway.table <- table(a,b)
twoway.table
##            b
## a           Maybe No Yes
##   Always        2  0   0
##   Never         0  1   1
##   Sometimes     2  1   1
# An example
sexsmoke<-matrix(c(70,120,65,140),ncol=2,byrow=TRUE)
rownames(sexsmoke)<-c("male","female")
colnames(sexsmoke)<-c("smoke","nosmoke")
sexsmoke <- as.table(sexsmoke)
sexsmoke
##        smoke nosmoke
## male      70     120
## female    65     140

Control structures

Conditional excutions

  • equal: ==
  • not equal: !=
  • greater/less than: >, <
  • greater/less than or equal: >=, <=

Logical operators

  • and: &, &&
  • or: |, ||
  • not: !

if-else statements

if (cond1==TRUE) {cmd1} else {cmd2}
# Example
if (1 == 0) {
    print(1)
} else {
    print(2)
}
## [1] 2

ifelse statements (ternary operator in R)

ifelse(test, true_value, false_value)
x <- 1:10
ifelse(x<5|x>8, x, 0)
##  [1]  1  2  3  4  0  0  0  0  9 10

switch-case statements

AA <- 'foo'
switch(AA,
       foo = {print('AA is foo')},
       bar = {print('AA is bar')},
       {print('Default')}
)
## [1] "AA is foo"

Loops

For loop

for (var in vector) {
    statement
}
# Example
mydf <- iris
head(mydf)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
myve <- NULL
for (i in 1:nrow(mydf)) {
    myve <- c(myve, mean(as.numeric(mydf[i, 1:3])))
}
myve
##   [1] 3.333333 3.100000 3.066667 3.066667 3.333333 3.666667 3.133333 3.300000
##   [9] 2.900000 3.166667 3.533333 3.266667 3.066667 2.800000 3.666667 3.866667
##  [17] 3.533333 3.333333 3.733333 3.466667 3.500000 3.433333 3.066667 3.366667
##  [25] 3.366667 3.200000 3.333333 3.400000 3.333333 3.166667 3.166667 3.433333
##  [33] 3.600000 3.700000 3.166667 3.133333 3.433333 3.300000 2.900000 3.333333
##  [41] 3.266667 2.700000 2.966667 3.366667 3.600000 3.066667 3.500000 3.066667
##  [49] 3.500000 3.233333 4.966667 4.700000 4.966667 3.933333 4.633333 4.333333
##  [57] 4.766667 3.533333 4.700000 3.933333 3.500000 4.366667 4.066667 4.566667
##  [65] 4.033333 4.733333 4.366667 4.200000 4.300000 4.000000 4.633333 4.300000
##  [73] 4.566667 4.533333 4.533333 4.666667 4.800000 4.900000 4.466667 3.933333
##  [81] 3.900000 3.866667 4.133333 4.600000 4.300000 4.633333 4.833333 4.333333
##  [89] 4.233333 4.000000 4.166667 4.566667 4.133333 3.533333 4.166667 4.300000
##  [97] 4.266667 4.466667 3.533333 4.200000 5.200000 4.533333 5.333333 4.933333
## [105] 5.100000 5.733333 3.966667 5.500000 5.000000 5.633333 4.933333 4.800000
## [113] 5.100000 4.400000 4.566667 4.966667 5.000000 6.066667 5.733333 4.400000
## [121] 5.266667 4.433333 5.733333 4.633333 5.233333 5.466667 4.600000 4.666667
## [129] 4.933333 5.333333 5.433333 6.033333 4.933333 4.733333 4.766667 5.600000
## [137] 5.100000 5.000000 4.600000 5.133333 5.133333 5.033333 4.533333 5.300000
## [145] 5.233333 4.966667 4.600000 4.900000 5.000000 4.666667

while loop

while (condition) statements
# Example
z <- 0
while (z < 5) {
    z <- z + 2
    print(z)
}
## [1] 2
## [1] 4
## [1] 6

apply loop

For matrix/array
apply(X, MARGIN, FUN, ARGS)

# Examples
apply(iris[,1:3], 1, mean)

x <- 1:10

apply(as.matrix(x), 1, function(i) {
    if (i < 5) 
        i - 1 
    else 
        i/i
})
For vector/list
lapply(X, FUN)
sapply(X, FUN)
# Examples
mylist <- as.list(iris[1:3, 1:3])
mylist
## $Sepal.Length
## [1] 5.1 4.9 4.7
## 
## $Sepal.Width
## [1] 3.5 3.0 3.2
## 
## $Petal.Length
## [1] 1.4 1.4 1.3
lapply(mylist, sum) # Compute sum of each list component and return result as list
## $Sepal.Length
## [1] 14.7
## 
## $Sepal.Width
## [1] 9.7
## 
## $Petal.Length
## [1] 4.1
sapply(mylist, sum) # Compute sum of each list component and return result as vector
## Sepal.Length  Sepal.Width Petal.Length 
##         14.7          9.7          4.1
More apply functions
  • tapply
  • mapply

function

FunctionName <- function(arg1, arg2, ...) { 
    statements
    return(R_object)
}
add <- function(a, b) {
    c <- a + b
    return(c)
}
x <- 5
y <- 7
z <- add(x,y)
z
## [1] 12

Advanced R programming

Garbage collection

  • rm()
  • gc()
x <- as.matrix(read.table("test.csv", sep="\t")) # x is a 4500000 x 220 matrix
y <- apply(x, 1, mean)
rm(list=c("x","y"))
gc()

Use data.table to speed up acquisition of data

See Introduction to the data.table package in R

Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development. - from CRAN

library(data.table)
grpsize <- ceiling(1e7/26^2)
DF <- data.frame(
    x=rep(LETTERS, each=26*grpsize),
    y=rep(letters, each=grpsize),
    v=runif(grpsize*26^2),
    stringsAsFactors=FALSE)
system.time(ans1 <- DF[DF$x=="R" & DF$y=="h",])
##    user  system elapsed 
##   0.057   0.010   0.067
DT <- as.data.table(DF)
setkey(DT, x, y)
system.time(ans2 <- DT[list("R","h")])
##    user  system elapsed 
##   0.013   0.001   0.004

Tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying philosophy and common APIs.

Hadley Wickham
Hadley Wickham
install.packages("tidyverse")
  • magrittr > A Forward-Pipe Operator for R

Use this equation as an example:

\[ \LARGE \boldsymbol{log(\sum_{i=1}^{n}exp(x_i))} \]

In R, you may want to calculate the equation with many functions like this:

log(sum(exp(MyData)), exp(1))

With magrittr, you can calculate the equation like this:

MyData %>% exp %>% sum %>% log(exp(1))
  • plyr

“ plyr is a set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together. It’s already possible to do this with split and the apply functions, but plyr just makes it all a bit easier. . . ”

set.seed(1)
d <- data.frame(year = rep(2000:2005, each=3),
                count = round(runif(runif(18, 0, 20)))
                )

print(d)
##    year count
## 1  2000     0
## 2  2000     1
## 3  2000     1
## 4  2001     0
## 5  2001     1
## 6  2001     0
## 7  2002     0
## 8  2002     0
## 9  2002     0
## 10 2003     0
## 11 2003     1
## 12 2003     0
## 13 2004     0
## 14 2004     1
## 15 2004     0
## 16 2005     0
## 17 2005     1
## 18 2005     1
library(plyr)
ddply(d, "year", function(x) {
    mean.count <- mean(x$count)
    sd.count <- sd(x$count)
    cv <- sd.count/mean.count
    data.frame(cv.count=cv)
})
##   year  cv.count
## 1 2000 0.8660254
## 2 2001 1.7320508
## 3 2002       NaN
## 4 2003 1.7320508
## 5 2004 1.7320508
## 6 2005 0.8660254
  • dplyr > dplyr is a package for data manipulation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation.

    • filter(): the function will return all the rows that satisfy a following condition.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Let's start with a dataset about air quality
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
# Filter the records with Temp <= 70
filter(airquality, Temp > 70)
##     Ozone Solar.R Wind Temp Month Day
## 1      36     118  8.0   72     5   2
## 2      12     149 12.6   74     5   3
## 3       7      NA  6.9   74     5  11
## 4      11     320 16.6   73     5  22
## 5      45     252 14.9   81     5  29
## 6     115     223  5.7   79     5  30
## 7      37     279  7.4   76     5  31
## 8      NA     286  8.6   78     6   1
## 9      NA     287  9.7   74     6   2
## 10     NA     186  9.2   84     6   4
## 11     NA     220  8.6   85     6   5
## 12     NA     264 14.3   79     6   6
## 13     29     127  9.7   82     6   7
## 14     NA     273  6.9   87     6   8
## 15     71     291 13.8   90     6   9
## 16     39     323 11.5   87     6  10
## 17     NA     259 10.9   93     6  11
## 18     NA     250  9.2   92     6  12
## 19     23     148  8.0   82     6  13
## 20     NA     332 13.8   80     6  14
## 21     NA     322 11.5   79     6  15
## 22     21     191 14.9   77     6  16
## 23     37     284 20.7   72     6  17
## 24     12     120 11.5   73     6  19
## 25     13     137 10.3   76     6  20
## 26     NA     150  6.3   77     6  21
## 27     NA      59  1.7   76     6  22
## 28     NA      91  4.6   76     6  23
## 29     NA     250  6.3   76     6  24
## 30     NA     135  8.0   75     6  25
## 31     NA     127  8.0   78     6  26
## 32     NA      47 10.3   73     6  27
## 33     NA      98 11.5   80     6  28
## 34     NA      31 14.9   77     6  29
## 35     NA     138  8.0   83     6  30
## 36    135     269  4.1   84     7   1
## 37     49     248  9.2   85     7   2
## 38     32     236  9.2   81     7   3
## 39     NA     101 10.9   84     7   4
## 40     64     175  4.6   83     7   5
## 41     40     314 10.9   83     7   6
## 42     77     276  5.1   88     7   7
## 43     97     267  6.3   92     7   8
## 44     97     272  5.7   92     7   9
## 45     85     175  7.4   89     7  10
## 46     NA     139  8.6   82     7  11
## 47     10     264 14.3   73     7  12
## 48     27     175 14.9   81     7  13
## 49     NA     291 14.9   91     7  14
## 50      7      48 14.3   80     7  15
## 51     48     260  6.9   81     7  16
## 52     35     274 10.3   82     7  17
## 53     61     285  6.3   84     7  18
## 54     79     187  5.1   87     7  19
## 55     63     220 11.5   85     7  20
## 56     16       7  6.9   74     7  21
## 57     NA     258  9.7   81     7  22
## 58     NA     295 11.5   82     7  23
## 59     80     294  8.6   86     7  24
## 60    108     223  8.0   85     7  25
## 61     20      81  8.6   82     7  26
## 62     52      82 12.0   86     7  27
## 63     82     213  7.4   88     7  28
## 64     50     275  7.4   86     7  29
## 65     64     253  7.4   83     7  30
## 66     59     254  9.2   81     7  31
## 67     39      83  6.9   81     8   1
## 68      9      24 13.8   81     8   2
## 69     16      77  7.4   82     8   3
## 70     78      NA  6.9   86     8   4
## 71     35      NA  7.4   85     8   5
## 72     66      NA  4.6   87     8   6
## 73    122     255  4.0   89     8   7
## 74     89     229 10.3   90     8   8
## 75    110     207  8.0   90     8   9
## 76     NA     222  8.6   92     8  10
## 77     NA     137 11.5   86     8  11
## 78     44     192 11.5   86     8  12
## 79     28     273 11.5   82     8  13
## 80     65     157  9.7   80     8  14
## 81     NA      64 11.5   79     8  15
## 82     22      71 10.3   77     8  16
## 83     59      51  6.3   79     8  17
## 84     23     115  7.4   76     8  18
## 85     31     244 10.9   78     8  19
## 86     44     190 10.3   78     8  20
## 87     21     259 15.5   77     8  21
## 88      9      36 14.3   72     8  22
## 89     NA     255 12.6   75     8  23
## 90     45     212  9.7   79     8  24
## 91    168     238  3.4   81     8  25
## 92     73     215  8.0   86     8  26
## 93     NA     153  5.7   88     8  27
## 94     76     203  9.7   97     8  28
## 95    118     225  2.3   94     8  29
## 96     84     237  6.3   96     8  30
## 97     85     188  6.3   94     8  31
## 98     96     167  6.9   91     9   1
## 99     78     197  5.1   92     9   2
## 100    73     183  2.8   93     9   3
## 101    91     189  4.6   93     9   4
## 102    47      95  7.4   87     9   5
## 103    32      92 15.5   84     9   6
## 104    20     252 10.9   80     9   7
## 105    23     220 10.3   78     9   8
## 106    21     230 10.9   75     9   9
## 107    24     259  9.7   73     9  10
## 108    44     236 14.9   81     9  11
## 109    21     259 15.5   76     9  12
## 110    28     238  6.3   77     9  13
## 111     9      24 10.9   71     9  14
## 112    13     112 11.5   71     9  15
## 113    46     237  6.9   78     9  16
## 114    13      27 10.3   76     9  18
## 115    16     201  8.0   82     9  20
## 116    23      14  9.2   71     9  22
## 117    36     139 10.3   81     9  23
## 118    NA     145 13.2   77     9  27
## 119    14     191 14.3   75     9  28
## 120    18     131  8.0   76     9  29
# Select the records with Temp > 80 & Month is after May
filter(airquality, Temp > 80 & Month > 5)
##    Ozone Solar.R Wind Temp Month Day
## 1     NA     186  9.2   84     6   4
## 2     NA     220  8.6   85     6   5
## 3     29     127  9.7   82     6   7
## 4     NA     273  6.9   87     6   8
## 5     71     291 13.8   90     6   9
## 6     39     323 11.5   87     6  10
## 7     NA     259 10.9   93     6  11
## 8     NA     250  9.2   92     6  12
## 9     23     148  8.0   82     6  13
## 10    NA     138  8.0   83     6  30
## 11   135     269  4.1   84     7   1
## 12    49     248  9.2   85     7   2
## 13    32     236  9.2   81     7   3
## 14    NA     101 10.9   84     7   4
## 15    64     175  4.6   83     7   5
## 16    40     314 10.9   83     7   6
## 17    77     276  5.1   88     7   7
## 18    97     267  6.3   92     7   8
## 19    97     272  5.7   92     7   9
## 20    85     175  7.4   89     7  10
## 21    NA     139  8.6   82     7  11
## 22    27     175 14.9   81     7  13
## 23    NA     291 14.9   91     7  14
## 24    48     260  6.9   81     7  16
## 25    35     274 10.3   82     7  17
## 26    61     285  6.3   84     7  18
## 27    79     187  5.1   87     7  19
## 28    63     220 11.5   85     7  20
## 29    NA     258  9.7   81     7  22
## 30    NA     295 11.5   82     7  23
## 31    80     294  8.6   86     7  24
## 32   108     223  8.0   85     7  25
## 33    20      81  8.6   82     7  26
## 34    52      82 12.0   86     7  27
## 35    82     213  7.4   88     7  28
## 36    50     275  7.4   86     7  29
## 37    64     253  7.4   83     7  30
## 38    59     254  9.2   81     7  31
## 39    39      83  6.9   81     8   1
## 40     9      24 13.8   81     8   2
## 41    16      77  7.4   82     8   3
## 42    78      NA  6.9   86     8   4
## 43    35      NA  7.4   85     8   5
## 44    66      NA  4.6   87     8   6
## 45   122     255  4.0   89     8   7
## 46    89     229 10.3   90     8   8
## 47   110     207  8.0   90     8   9
## 48    NA     222  8.6   92     8  10
## 49    NA     137 11.5   86     8  11
## 50    44     192 11.5   86     8  12
## 51    28     273 11.5   82     8  13
## 52   168     238  3.4   81     8  25
## 53    73     215  8.0   86     8  26
## 54    NA     153  5.7   88     8  27
## 55    76     203  9.7   97     8  28
## 56   118     225  2.3   94     8  29
## 57    84     237  6.3   96     8  30
## 58    85     188  6.3   94     8  31
## 59    96     167  6.9   91     9   1
## 60    78     197  5.1   92     9   2
## 61    73     183  2.8   93     9   3
## 62    91     189  4.6   93     9   4
## 63    47      95  7.4   87     9   5
## 64    32      92 15.5   84     9   6
## 65    44     236 14.9   81     9  11
## 66    16     201  8.0   82     9  20
## 67    36     139 10.3   81     9  23
  • mutate(): the function is used to add new variables to the data.
mutate(airquality, TempInC = (Temp - 32) * 5 / 9)
##     Ozone Solar.R Wind Temp Month Day  TempInC
## 1      41     190  7.4   67     5   1 19.44444
## 2      36     118  8.0   72     5   2 22.22222
## 3      12     149 12.6   74     5   3 23.33333
## 4      18     313 11.5   62     5   4 16.66667
## 5      NA      NA 14.3   56     5   5 13.33333
## 6      28      NA 14.9   66     5   6 18.88889
## 7      23     299  8.6   65     5   7 18.33333
## 8      19      99 13.8   59     5   8 15.00000
## 9       8      19 20.1   61     5   9 16.11111
## 10     NA     194  8.6   69     5  10 20.55556
## 11      7      NA  6.9   74     5  11 23.33333
## 12     16     256  9.7   69     5  12 20.55556
## 13     11     290  9.2   66     5  13 18.88889
## 14     14     274 10.9   68     5  14 20.00000
## 15     18      65 13.2   58     5  15 14.44444
## 16     14     334 11.5   64     5  16 17.77778
## 17     34     307 12.0   66     5  17 18.88889
## 18      6      78 18.4   57     5  18 13.88889
## 19     30     322 11.5   68     5  19 20.00000
## 20     11      44  9.7   62     5  20 16.66667
## 21      1       8  9.7   59     5  21 15.00000
## 22     11     320 16.6   73     5  22 22.77778
## 23      4      25  9.7   61     5  23 16.11111
## 24     32      92 12.0   61     5  24 16.11111
## 25     NA      66 16.6   57     5  25 13.88889
## 26     NA     266 14.9   58     5  26 14.44444
## 27     NA      NA  8.0   57     5  27 13.88889
## 28     23      13 12.0   67     5  28 19.44444
## 29     45     252 14.9   81     5  29 27.22222
## 30    115     223  5.7   79     5  30 26.11111
## 31     37     279  7.4   76     5  31 24.44444
## 32     NA     286  8.6   78     6   1 25.55556
## 33     NA     287  9.7   74     6   2 23.33333
## 34     NA     242 16.1   67     6   3 19.44444
## 35     NA     186  9.2   84     6   4 28.88889
## 36     NA     220  8.6   85     6   5 29.44444
## 37     NA     264 14.3   79     6   6 26.11111
## 38     29     127  9.7   82     6   7 27.77778
## 39     NA     273  6.9   87     6   8 30.55556
## 40     71     291 13.8   90     6   9 32.22222
## 41     39     323 11.5   87     6  10 30.55556
## 42     NA     259 10.9   93     6  11 33.88889
## 43     NA     250  9.2   92     6  12 33.33333
## 44     23     148  8.0   82     6  13 27.77778
## 45     NA     332 13.8   80     6  14 26.66667
## 46     NA     322 11.5   79     6  15 26.11111
## 47     21     191 14.9   77     6  16 25.00000
## 48     37     284 20.7   72     6  17 22.22222
## 49     20      37  9.2   65     6  18 18.33333
## 50     12     120 11.5   73     6  19 22.77778
## 51     13     137 10.3   76     6  20 24.44444
## 52     NA     150  6.3   77     6  21 25.00000
## 53     NA      59  1.7   76     6  22 24.44444
## 54     NA      91  4.6   76     6  23 24.44444
## 55     NA     250  6.3   76     6  24 24.44444
## 56     NA     135  8.0   75     6  25 23.88889
## 57     NA     127  8.0   78     6  26 25.55556
## 58     NA      47 10.3   73     6  27 22.77778
## 59     NA      98 11.5   80     6  28 26.66667
## 60     NA      31 14.9   77     6  29 25.00000
## 61     NA     138  8.0   83     6  30 28.33333
## 62    135     269  4.1   84     7   1 28.88889
## 63     49     248  9.2   85     7   2 29.44444
## 64     32     236  9.2   81     7   3 27.22222
## 65     NA     101 10.9   84     7   4 28.88889
## 66     64     175  4.6   83     7   5 28.33333
## 67     40     314 10.9   83     7   6 28.33333
## 68     77     276  5.1   88     7   7 31.11111
## 69     97     267  6.3   92     7   8 33.33333
## 70     97     272  5.7   92     7   9 33.33333
## 71     85     175  7.4   89     7  10 31.66667
## 72     NA     139  8.6   82     7  11 27.77778
## 73     10     264 14.3   73     7  12 22.77778
## 74     27     175 14.9   81     7  13 27.22222
## 75     NA     291 14.9   91     7  14 32.77778
## 76      7      48 14.3   80     7  15 26.66667
## 77     48     260  6.9   81     7  16 27.22222
## 78     35     274 10.3   82     7  17 27.77778
## 79     61     285  6.3   84     7  18 28.88889
## 80     79     187  5.1   87     7  19 30.55556
## 81     63     220 11.5   85     7  20 29.44444
## 82     16       7  6.9   74     7  21 23.33333
## 83     NA     258  9.7   81     7  22 27.22222
## 84     NA     295 11.5   82     7  23 27.77778
## 85     80     294  8.6   86     7  24 30.00000
## 86    108     223  8.0   85     7  25 29.44444
## 87     20      81  8.6   82     7  26 27.77778
## 88     52      82 12.0   86     7  27 30.00000
## 89     82     213  7.4   88     7  28 31.11111
## 90     50     275  7.4   86     7  29 30.00000
## 91     64     253  7.4   83     7  30 28.33333
## 92     59     254  9.2   81     7  31 27.22222
## 93     39      83  6.9   81     8   1 27.22222
## 94      9      24 13.8   81     8   2 27.22222
## 95     16      77  7.4   82     8   3 27.77778
## 96     78      NA  6.9   86     8   4 30.00000
## 97     35      NA  7.4   85     8   5 29.44444
## 98     66      NA  4.6   87     8   6 30.55556
## 99    122     255  4.0   89     8   7 31.66667
## 100    89     229 10.3   90     8   8 32.22222
## 101   110     207  8.0   90     8   9 32.22222
## 102    NA     222  8.6   92     8  10 33.33333
## 103    NA     137 11.5   86     8  11 30.00000
## 104    44     192 11.5   86     8  12 30.00000
## 105    28     273 11.5   82     8  13 27.77778
## 106    65     157  9.7   80     8  14 26.66667
## 107    NA      64 11.5   79     8  15 26.11111
## 108    22      71 10.3   77     8  16 25.00000
## 109    59      51  6.3   79     8  17 26.11111
## 110    23     115  7.4   76     8  18 24.44444
## 111    31     244 10.9   78     8  19 25.55556
## 112    44     190 10.3   78     8  20 25.55556
## 113    21     259 15.5   77     8  21 25.00000
## 114     9      36 14.3   72     8  22 22.22222
## 115    NA     255 12.6   75     8  23 23.88889
## 116    45     212  9.7   79     8  24 26.11111
## 117   168     238  3.4   81     8  25 27.22222
## 118    73     215  8.0   86     8  26 30.00000
## 119    NA     153  5.7   88     8  27 31.11111
## 120    76     203  9.7   97     8  28 36.11111
## 121   118     225  2.3   94     8  29 34.44444
## 122    84     237  6.3   96     8  30 35.55556
## 123    85     188  6.3   94     8  31 34.44444
## 124    96     167  6.9   91     9   1 32.77778
## 125    78     197  5.1   92     9   2 33.33333
## 126    73     183  2.8   93     9   3 33.88889
## 127    91     189  4.6   93     9   4 33.88889
## 128    47      95  7.4   87     9   5 30.55556
## 129    32      92 15.5   84     9   6 28.88889
## 130    20     252 10.9   80     9   7 26.66667
## 131    23     220 10.3   78     9   8 25.55556
## 132    21     230 10.9   75     9   9 23.88889
## 133    24     259  9.7   73     9  10 22.77778
## 134    44     236 14.9   81     9  11 27.22222
## 135    21     259 15.5   76     9  12 24.44444
## 136    28     238  6.3   77     9  13 25.00000
## 137     9      24 10.9   71     9  14 21.66667
## 138    13     112 11.5   71     9  15 21.66667
## 139    46     237  6.9   78     9  16 25.55556
## 140    18     224 13.8   67     9  17 19.44444
## 141    13      27 10.3   76     9  18 24.44444
## 142    24     238 10.3   68     9  19 20.00000
## 143    16     201  8.0   82     9  20 27.77778
## 144    13     238 12.6   64     9  21 17.77778
## 145    23      14  9.2   71     9  22 21.66667
## 146    36     139 10.3   81     9  23 27.22222
## 147     7      49 10.3   69     9  24 20.55556
## 148    14      20 16.6   63     9  25 17.22222
## 149    30     193  6.9   70     9  26 21.11111
## 150    NA     145 13.2   77     9  27 25.00000
## 151    14     191 14.3   75     9  28 23.88889
## 152    18     131  8.0   76     9  29 24.44444
## 153    20     223 11.5   68     9  30 20.00000
  • summarise(): the function is used to summarise multiple values into a single value.
summarise(airquality, mean(Temp, na.rm = TRUE))
##   mean(Temp, na.rm = TRUE)
## 1                 77.88235
  • group_by(): the function is used to group data by one or more variables.
summarise(group_by(airquality, Month), mean(Temp, na.rm = TRUE))
## # A tibble: 5 × 2
##   Month `mean(Temp, na.rm = TRUE)`
##   <int>                      <dbl>
## 1     5                       65.5
## 2     6                       79.1
## 3     7                       83.9
## 4     8                       84.0
## 5     9                       76.9
  • sample_n() and sample_frac(): these two functions are used to select random rows from a table.
sample_n(airquality, size = 10)
##    Ozone Solar.R Wind Temp Month Day
## 1     NA     295 11.5   82     7  23
## 2     97     272  5.7   92     7   9
## 3     27     175 14.9   81     7  13
## 4     NA     259 10.9   93     6  11
## 5     31     244 10.9   78     8  19
## 6     14      20 16.6   63     9  25
## 7     11      44  9.7   62     5  20
## 8     23     148  8.0   82     6  13
## 9    118     225  2.3   94     8  29
## 10    20      81  8.6   82     7  26
sample_frac(airquality, size = 0.1)
##    Ozone Solar.R Wind Temp Month Day
## 1     97     272  5.7   92     7   9
## 2    118     225  2.3   94     8  29
## 3     71     291 13.8   90     6   9
## 4     NA      66 16.6   57     5  25
## 5     NA     153  5.7   88     8  27
## 6     84     237  6.3   96     8  30
## 7     NA     273  6.9   87     6   8
## 8     NA     259 10.9   93     6  11
## 9     44     236 14.9   81     9  11
## 10    32      92 12.0   61     5  24
## 11    14     274 10.9   68     5  14
## 12    20     252 10.9   80     9   7
## 13    NA     332 13.8   80     6  14
## 14    11     320 16.6   73     5  22
## 15    NA     255 12.6   75     8  23
  • count(): the function tallies observations based on a group.
count(airquality, Month)
##   Month  n
## 1     5 31
## 2     6 30
## 3     7 31
## 4     8 31
## 5     9 30
  • arrange(): the function is used to arrange rows by variables.
arrange(airquality, desc(Month), Day)
##     Ozone Solar.R Wind Temp Month Day
## 1      96     167  6.9   91     9   1
## 2      78     197  5.1   92     9   2
## 3      73     183  2.8   93     9   3
## 4      91     189  4.6   93     9   4
## 5      47      95  7.4   87     9   5
## 6      32      92 15.5   84     9   6
## 7      20     252 10.9   80     9   7
## 8      23     220 10.3   78     9   8
## 9      21     230 10.9   75     9   9
## 10     24     259  9.7   73     9  10
## 11     44     236 14.9   81     9  11
## 12     21     259 15.5   76     9  12
## 13     28     238  6.3   77     9  13
## 14      9      24 10.9   71     9  14
## 15     13     112 11.5   71     9  15
## 16     46     237  6.9   78     9  16
## 17     18     224 13.8   67     9  17
## 18     13      27 10.3   76     9  18
## 19     24     238 10.3   68     9  19
## 20     16     201  8.0   82     9  20
## 21     13     238 12.6   64     9  21
## 22     23      14  9.2   71     9  22
## 23     36     139 10.3   81     9  23
## 24      7      49 10.3   69     9  24
## 25     14      20 16.6   63     9  25
## 26     30     193  6.9   70     9  26
## 27     NA     145 13.2   77     9  27
## 28     14     191 14.3   75     9  28
## 29     18     131  8.0   76     9  29
## 30     20     223 11.5   68     9  30
## 31     39      83  6.9   81     8   1
## 32      9      24 13.8   81     8   2
## 33     16      77  7.4   82     8   3
## 34     78      NA  6.9   86     8   4
## 35     35      NA  7.4   85     8   5
## 36     66      NA  4.6   87     8   6
## 37    122     255  4.0   89     8   7
## 38     89     229 10.3   90     8   8
## 39    110     207  8.0   90     8   9
## 40     NA     222  8.6   92     8  10
## 41     NA     137 11.5   86     8  11
## 42     44     192 11.5   86     8  12
## 43     28     273 11.5   82     8  13
## 44     65     157  9.7   80     8  14
## 45     NA      64 11.5   79     8  15
## 46     22      71 10.3   77     8  16
## 47     59      51  6.3   79     8  17
## 48     23     115  7.4   76     8  18
## 49     31     244 10.9   78     8  19
## 50     44     190 10.3   78     8  20
## 51     21     259 15.5   77     8  21
## 52      9      36 14.3   72     8  22
## 53     NA     255 12.6   75     8  23
## 54     45     212  9.7   79     8  24
## 55    168     238  3.4   81     8  25
## 56     73     215  8.0   86     8  26
## 57     NA     153  5.7   88     8  27
## 58     76     203  9.7   97     8  28
## 59    118     225  2.3   94     8  29
## 60     84     237  6.3   96     8  30
## 61     85     188  6.3   94     8  31
## 62    135     269  4.1   84     7   1
## 63     49     248  9.2   85     7   2
## 64     32     236  9.2   81     7   3
## 65     NA     101 10.9   84     7   4
## 66     64     175  4.6   83     7   5
## 67     40     314 10.9   83     7   6
## 68     77     276  5.1   88     7   7
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 71     85     175  7.4   89     7  10
## 72     NA     139  8.6   82     7  11
## 73     10     264 14.3   73     7  12
## 74     27     175 14.9   81     7  13
## 75     NA     291 14.9   91     7  14
## 76      7      48 14.3   80     7  15
## 77     48     260  6.9   81     7  16
## 78     35     274 10.3   82     7  17
## 79     61     285  6.3   84     7  18
## 80     79     187  5.1   87     7  19
## 81     63     220 11.5   85     7  20
## 82     16       7  6.9   74     7  21
## 83     NA     258  9.7   81     7  22
## 84     NA     295 11.5   82     7  23
## 85     80     294  8.6   86     7  24
## 86    108     223  8.0   85     7  25
## 87     20      81  8.6   82     7  26
## 88     52      82 12.0   86     7  27
## 89     82     213  7.4   88     7  28
## 90     50     275  7.4   86     7  29
## 91     64     253  7.4   83     7  30
## 92     59     254  9.2   81     7  31
## 93     NA     286  8.6   78     6   1
## 94     NA     287  9.7   74     6   2
## 95     NA     242 16.1   67     6   3
## 96     NA     186  9.2   84     6   4
## 97     NA     220  8.6   85     6   5
## 98     NA     264 14.3   79     6   6
## 99     29     127  9.7   82     6   7
## 100    NA     273  6.9   87     6   8
## 101    71     291 13.8   90     6   9
## 102    39     323 11.5   87     6  10
## 103    NA     259 10.9   93     6  11
## 104    NA     250  9.2   92     6  12
## 105    23     148  8.0   82     6  13
## 106    NA     332 13.8   80     6  14
## 107    NA     322 11.5   79     6  15
## 108    21     191 14.9   77     6  16
## 109    37     284 20.7   72     6  17
## 110    20      37  9.2   65     6  18
## 111    12     120 11.5   73     6  19
## 112    13     137 10.3   76     6  20
## 113    NA     150  6.3   77     6  21
## 114    NA      59  1.7   76     6  22
## 115    NA      91  4.6   76     6  23
## 116    NA     250  6.3   76     6  24
## 117    NA     135  8.0   75     6  25
## 118    NA     127  8.0   78     6  26
## 119    NA      47 10.3   73     6  27
## 120    NA      98 11.5   80     6  28
## 121    NA      31 14.9   77     6  29
## 122    NA     138  8.0   83     6  30
## 123    41     190  7.4   67     5   1
## 124    36     118  8.0   72     5   2
## 125    12     149 12.6   74     5   3
## 126    18     313 11.5   62     5   4
## 127    NA      NA 14.3   56     5   5
## 128    28      NA 14.9   66     5   6
## 129    23     299  8.6   65     5   7
## 130    19      99 13.8   59     5   8
## 131     8      19 20.1   61     5   9
## 132    NA     194  8.6   69     5  10
## 133     7      NA  6.9   74     5  11
## 134    16     256  9.7   69     5  12
## 135    11     290  9.2   66     5  13
## 136    14     274 10.9   68     5  14
## 137    18      65 13.2   58     5  15
## 138    14     334 11.5   64     5  16
## 139    34     307 12.0   66     5  17
## 140     6      78 18.4   57     5  18
## 141    30     322 11.5   68     5  19
## 142    11      44  9.7   62     5  20
## 143     1       8  9.7   59     5  21
## 144    11     320 16.6   73     5  22
## 145     4      25  9.7   61     5  23
## 146    32      92 12.0   61     5  24
## 147    NA      66 16.6   57     5  25
## 148    NA     266 14.9   58     5  26
## 149    NA      NA  8.0   57     5  27
## 150    23      13 12.0   67     5  28
## 151    45     252 14.9   81     5  29
## 152   115     223  5.7   79     5  30
## 153    37     279  7.4   76     5  31

Now, let’s put those commands together!

airquality %>% 
    filter(Temp > 70 & Month != 5) %>% 
    group_by(Month) %>% 
    summarise(mean(Temp, na.rm = TRUE))
## # A tibble: 4 × 2
##   Month `mean(Temp, na.rm = TRUE)`
##   <int>                      <dbl>
## 1     6                       80.0
## 2     7                       83.9
## 3     8                       84.0
## 4     9                       79.9
  • tidyr > tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages).

    • gather(data, key, value, …, na.rm = FALSE, convert = FALSE)
library(tidyr)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars$car <- rownames(mtcars)
mtcars <- mtcars[, c(12, 1:11)]
head(mtcars)
##                                 car  mpg cyl disp  hp drat    wt  qsec vs am
## Mazda RX4                 Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1
## Mazda RX4 Wag         Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1
## Datsun 710               Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1
## Hornet 4 Drive       Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0
## Hornet Sportabout Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0
## Valiant                     Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0
##                   gear carb
## Mazda RX4            4    4
## Mazda RX4 Wag        4    4
## Datsun 710           4    1
## Hornet 4 Drive       3    1
## Hornet Sportabout    3    2
## Valiant              3    1
mtcarNew <- mtcars %>% gather(attribute, value, -car)
head(mtcarNew)
##                 car attribute value
## 1         Mazda RX4       mpg  21.0
## 2     Mazda RX4 Wag       mpg  21.0
## 3        Datsun 710       mpg  22.8
## 4    Hornet 4 Drive       mpg  21.4
## 5 Hornet Sportabout       mpg  18.7
## 6           Valiant       mpg  18.1
tail(mtcarNew)
##                car attribute value
## 347  Porsche 914-2      carb     2
## 348   Lotus Europa      carb     2
## 349 Ford Pantera L      carb     4
## 350   Ferrari Dino      carb     6
## 351  Maserati Bora      carb     8
## 352     Volvo 142E      carb     2
* spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)
mtcarSpread <- mtcarNew %>% spread(attribute, value)
head(mtcarSpread)
##                  car am carb cyl disp drat gear  hp  mpg  qsec vs    wt
## 1        AMC Javelin  0    2   8  304 3.15    3 150 15.2 17.30  0 3.435
## 2 Cadillac Fleetwood  0    4   8  472 2.93    3 205 10.4 17.98  0 5.250
## 3         Camaro Z28  0    4   8  350 3.73    3 245 13.3 15.41  0 3.840
## 4  Chrysler Imperial  0    4   8  440 3.23    3 230 14.7 17.42  0 5.345
## 5         Datsun 710  1    1   4  108 3.85    4  93 22.8 18.61  1 2.320
## 6   Dodge Challenger  0    2   8  318 2.76    3 150 15.5 16.87  0 3.520
* unite(data, col, ..., sep = "_", remove = TRUE)
set.seed(1)
date <- as.Date('2016-01-01') + 0:14
hour <- sample(1:24, 15)
min <- sample(1:60, 15)
second <- sample(1:60, 15)
event <- sample(letters, 15)
data <- data.frame(date, hour, min, second, event)
data
##          date hour min second event
## 1  2016-01-01    4  15     35     w
## 2  2016-01-02    7  21      6     x
## 3  2016-01-03    1  37     10     f
## 4  2016-01-04    2  41     42     g
## 5  2016-01-05   11  25     38     s
## 6  2016-01-06   14  46     47     j
## 7  2016-01-07   18  58     20     y
## 8  2016-01-08   22  54     28     n
## 9  2016-01-09    5  34     54     b
## 10 2016-01-10   16  42     44     m
## 11 2016-01-11   10  56     23     r
## 12 2016-01-12    6  44     59     t
## 13 2016-01-13   19  60     40     v
## 14 2016-01-14   23  33     51     o
## 15 2016-01-15    9  20     25     a
dataNew <- data %>%
  unite(datehour, date, hour, sep = ' ') %>%
  unite(datetime, datehour, min, second, sep = ':')
dataNew
##               datetime event
## 1   2016-01-01 4:15:35     w
## 2    2016-01-02 7:21:6     x
## 3   2016-01-03 1:37:10     f
## 4   2016-01-04 2:41:42     g
## 5  2016-01-05 11:25:38     s
## 6  2016-01-06 14:46:47     j
## 7  2016-01-07 18:58:20     y
## 8  2016-01-08 22:54:28     n
## 9   2016-01-09 5:34:54     b
## 10 2016-01-10 16:42:44     m
## 11 2016-01-11 10:56:23     r
## 12  2016-01-12 6:44:59     t
## 13 2016-01-13 19:60:40     v
## 14 2016-01-14 23:33:51     o
## 15  2016-01-15 9:20:25     a
* separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)
data1 <- dataNew %>% 
  separate(datetime, c('date', 'time'), sep = ' ') %>% 
  separate(time, c('hour', 'min', 'second'), sep = ':')
data1
##          date hour min second event
## 1  2016-01-01    4  15     35     w
## 2  2016-01-02    7  21      6     x
## 3  2016-01-03    1  37     10     f
## 4  2016-01-04    2  41     42     g
## 5  2016-01-05   11  25     38     s
## 6  2016-01-06   14  46     47     j
## 7  2016-01-07   18  58     20     y
## 8  2016-01-08   22  54     28     n
## 9  2016-01-09    5  34     54     b
## 10 2016-01-10   16  42     44     m
## 11 2016-01-11   10  56     23     r
## 12 2016-01-12    6  44     59     t
## 13 2016-01-13   19  60     40     v
## 14 2016-01-14   23  33     51     o
## 15 2016-01-15    9  20     25     a
  • purrr

    purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the map() functions is the iteration chapter in R for data science.

library(purrr)
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:plyr':
## 
##     compact
## The following object is masked from 'package:data.table':
## 
##     transpose
mtcars %>%
  split(.$cyl) %>% # from base R
  map(~ lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_dbl("r.squared")
##         4         6         8 
## 0.5086326 0.4645102 0.4229655
  • stringr

    stringr is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. stringr focusses on the most important and commonly used string manipulation functions whereas stringi provides a comprehensive set covering almost anything you can imagine.

library(stringr)

x <- c("why", "video", "cross", "extra", "deal", "authority")
str_length(x) 
## [1] 3 5 5 5 4 9
str_c(x, collapse = ", ")
## [1] "why, video, cross, extra, deal, authority"
str_sub(x, 1, 2)
## [1] "wh" "vi" "cr" "ex" "de" "au"
str_dup(x, 2:7)
## [1] "whywhy"                                                         
## [2] "videovideovideo"                                                
## [3] "crosscrosscrosscross"                                           
## [4] "extraextraextraextraextra"                                      
## [5] "dealdealdealdealdealdeal"                                       
## [6] "authorityauthorityauthorityauthorityauthorityauthorityauthority"
str_subset(x, "[aeiou]")
## [1] "video"     "cross"     "extra"     "deal"      "authority"
str_count(x, "[aeiou]")
## [1] 0 3 1 2 2 4
str_detect(x, "[aeiou]")
## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
str_subset(x, "[aeiou]")
## [1] "video"     "cross"     "extra"     "deal"      "authority"
str_locate(x, "[aeiou]")
##      start end
## [1,]    NA  NA
## [2,]     2   2
## [3,]     3   3
## [4,]     1   1
## [5,]     2   2
## [6,]     1   1
str_extract(x, "[aeiou]")
## [1] NA  "i" "o" "e" "e" "a"
str_match(x, "(.)[aeiou](.)")
##      [,1]  [,2] [,3]
## [1,] NA    NA   NA  
## [2,] "vid" "v"  "d" 
## [3,] "ros" "r"  "s" 
## [4,] NA    NA   NA  
## [5,] "dea" "d"  "a" 
## [6,] "aut" "a"  "t"
str_replace(x, "[aeiou]", "?")
## [1] "why"       "v?deo"     "cr?ss"     "?xtra"     "d?al"      "?uthority"
str_split(c("a,b", "c,d,e"), ",")
## [[1]]
## [1] "a" "b"
## 
## [[2]]
## [1] "c" "d" "e"

What’s new in R 4.1.0?

1. New pipe operator: |>

rnorm(100, mean = 4, sd = 1) |>
  density() |>
  plot()

c("Homo sapiens", "Mus musculus", "Rattus norvegicus") |> {function(i) grepl("homo", i, ignore.case = TRUE)}()
## [1]  TRUE FALSE FALSE

2. Simplified function statement with \

  • How did we write a self-defined function in map() function before R 4.1.0?
map(
  letters[2:3],
  function(x) {
    pattern <- paste0("^", x)
    grep(pattern, ls("package:datasets"), value = TRUE, ignore.case = TRUE)
  }
)
  • Since R 4.1.0, we can write it again in this style:
map(
  letters[2:3],
  \(x){
    pattern <- paste0("^", x)
    grep(pattern, ls("package:datasets"), value = TRUE, ignore.case = TRUE)
  }
)
## [[1]]
## [1] "beaver1"      "beaver2"      "BJsales"      "BJsales.lead" "BOD"         
## 
## [[2]]
## [1] "cars"        "ChickWeight" "chickwts"    "co2"         "CO2"        
## [6] "crimtab"

Updates in R 4.2.0

  • Use pipe operator more elegantly with the underscore placeholder **_**
    • In R 4.1
mtcars |> (\(x) lm(hp ~ cyl, data = x))()
  • In R 4.2
mtcars |> lm(hp ~ cyl, data = _)

Drawing graph

ggplot2

ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.

The Grammar of Graphics Philosophy

ggplot2 follows the “Grammar of Graphics” approach, where plots are built layer by layer using these key components:

  1. Data: The dataset you want to visualize
  2. Aesthetics (aes): How variables map to visual properties (x, y, color, size, etc.)
  3. Geometries (geom): The visual elements (points, lines, bars, etc.)
  4. Statistics (stat): Statistical transformations of data
  5. Coordinates (coord): The coordinate system
  6. Facets: Split data into subplots
  7. Themes: Overall visual appearance
Basic ggplot2 Syntax

The basic structure of a ggplot2 command:

# Install ggplot2 if not already installed
install.packages("ggplot2")
library(ggplot2)

# Basic syntax
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Common geom cases for one variable
Common geom cases for one variable
Common geom cases for two variables
Common geom cases for two variables
Getting Started with Simple Examples

Let’s use the built-in mtcars dataset to learn ggplot2:

# First, let's explore our data
library(ggplot2)
head(mtcars)
##                                 car  mpg cyl disp  hp drat    wt  qsec vs am
## Mazda RX4                 Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1
## Mazda RX4 Wag         Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1
## Datsun 710               Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1
## Hornet 4 Drive       Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0
## Hornet Sportabout Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0
## Valiant                     Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0
##                   gear carb
## Mazda RX4            4    4
## Mazda RX4 Wag        4    4
## Datsun 710           4    1
## Hornet 4 Drive       3    1
## Hornet Sportabout    3    2
## Valiant              3    1
Basic Scatter Plot
# Basic scatter plot: mpg vs weight
ggplot(data = mtcars) + 
  geom_point(mapping = aes(x = wt, y = mpg))

Adding Color by Category
# Color points by number of cylinders
ggplot(data = mtcars) + 
  geom_point(mapping = aes(x = wt, y = mpg, color = factor(cyl)))

Adding Size and Shape
# Multiple aesthetic mappings
ggplot(data = mtcars) + 
  geom_point(mapping = aes(x = wt, y = mpg, 
                          color = factor(cyl), 
                          size = hp,
                          shape = factor(am)))

Common Geometric Objects (geoms)
Line Plots
# Line plot using economics dataset
ggplot(data = economics) + 
  geom_line(mapping = aes(x = date, y = unemploy))

Bar Charts
# Bar chart of car counts by cylinder
ggplot(data = mtcars) + 
  geom_bar(mapping = aes(x = factor(cyl), fill = factor(cyl)))

Histograms
# Histogram of mpg distribution
ggplot(data = mtcars) + 
  geom_histogram(mapping = aes(x = mpg), bins = 10, fill = "skyblue", color = "black")

Box Plots
# Box plot of mpg by cylinder
ggplot(data = mtcars) + 
  geom_boxplot(mapping = aes(x = factor(cyl), y = mpg, fill = factor(cyl)))

Statistical Transformations
# Adding smooth trend line
ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'

# Statistical summary
ggplot(data = mtcars, aes(x = factor(cyl), y = mpg)) + 
  geom_point(position = "jitter", alpha = 0.6) +
  stat_summary(fun = mean, geom = "point", color = "red", size = 3) +
  stat_summary(fun.data = mean_se, geom = "errorbar", color = "red", width = 0.2)

Coordinate Systems and Scales
# Coordinate transformation
ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + 
  coord_flip()  # Flip x and y axes

# Custom scales
ggplot(data = mtcars, aes(x = wt, y = mpg, color = hp)) + 
  geom_point(size = 3) + 
  scale_color_gradient(low = "blue", high = "red") +
  scale_x_continuous(name = "Weight (1000 lbs)") +
  scale_y_continuous(name = "Miles per Gallon")

Faceting (Multiple Panels)
# Facet wrap
ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + 
  facet_wrap(~ cyl, nrow = 2)

# Facet grid
ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() + 
  facet_grid(am ~ cyl, labeller = label_both)

Customizing Themes and Appearance
# Using built-in themes
ggplot(data = mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + 
  geom_point(size = 3) + 
  theme_minimal()

# Custom theme modifications
ggplot(data = mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + 
  geom_point(size = 3) + 
  labs(title = "Car Weight vs Fuel Efficiency",
       subtitle = "Relationship between weight and MPG by cylinder count",
       x = "Weight (1000 lbs)",
       y = "Miles per Gallon",
       color = "Cylinders",
       caption = "Data source: mtcars dataset") +
  theme_classic() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "gray50"),
    legend.position = "bottom",
    panel.grid.major = element_line(color = "gray90", size = 0.5)
  )
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Advanced Customization Techniques
Creating Professional Publication-Ready Plots
# Professional-looking plot with multiple layers
p <- ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point(aes(color = factor(cyl), size = hp), alpha = 0.7) + 
  geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
  scale_color_manual(values = c("4" = "#E69F00", "6" = "#56B4E9", "8" = "#CC79A7"),
                     name = "Cylinders") +
  scale_size_continuous(name = "Horsepower", range = c(2, 6)) +
  labs(
    title = "Relationship Between Car Weight and Fuel Efficiency",
    subtitle = "Data points colored by cylinder count and sized by horsepower",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon (MPG)",
    caption = "Source: Motor Trend Car Road Tests (mtcars dataset)"
  ) +
  theme_bw() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
    plot.caption = element_text(size = 9, color = "gray50"),
    legend.position = "right",
    legend.box = "vertical",
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = "gray90")
  )

print(p)
## `geom_smooth()` using formula = 'y ~ x'

Combining Multiple Plots
# Using patchwork package for combining plots (install if needed)
# install.packages("patchwork")
library(patchwork)

# Create individual plots
p1 <- ggplot(mtcars, aes(x = factor(cyl), y = mpg)) + 
  geom_boxplot(fill = "lightblue") + 
  labs(title = "MPG by Cylinders", x = "Cylinders", y = "MPG")

p2 <- ggplot(mtcars, aes(x = hp, y = mpg)) + 
  geom_point(color = "red") + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "MPG vs Horsepower", x = "Horsepower", y = "MPG")

p3 <- ggplot(mtcars, aes(x = mpg)) + 
  geom_histogram(bins = 10, fill = "green", alpha = 0.7) +
  labs(title = "MPG Distribution", x = "MPG", y = "Count")

# Combine plots
(p1 | p2) / p3
## `geom_smooth()` using formula = 'y ~ x'

Practical Tips for Effective Plotting
  1. Start Simple: Begin with basic plots and add complexity gradually
  2. Choose Appropriate Geoms: Match the geometry to your data type
  3. Use Color Wisely: Ensure accessibility with colorblind-friendly palettes
  4. Label Everything: Always include informative titles, axis labels, and legends
  5. Maintain Consistency: Use consistent styling across related plots
  6. Consider Your Audience: Adjust complexity based on who will view the plot
Useful ggplot2 Extensions
# Popular ggplot2 extension packages
install.packages(c("ggthemes", "viridis", "plotly", "gganimate"))

# Example with ggthemes
library(ggthemes)
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + 
  geom_point(size = 3) + 
  theme_economist() +
  scale_color_economist()

Version control

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. For the examples in this book you will use software source code as the files being version controlled, though in reality you can do this with nearly any type of file on a computer. –from git

Installing git

Git setup

git config --global user.name "YOUR NAME"
git config --global user.email you.email@address.org
git config --global core.ui true
git config --global core.editor vim

# For windows users
git config --global core.quotepath off

Git basics

## Initializing a repository in an existing directory
# Go to the project's directory and type
git init

# Add files you want to track
git add LICENSE
git add READ.md
git commit -m 'First commit. Add LICENSE & READ.md'

# Add new files
git add R.Rmd
git add helloworld.r
git commit -m 'Second commit. Add R.Rmd, helloworld.r'
git remote add origin
git push -u origin master

# Recover your codes to the last commit
git checkout -- filename
git reset --hard


## Cloning an existing repository
git clone https://github.com/godkin1211/Rcourses.git
git pull https://github.com/godkin1211/Rcourses.git

References